Practical - Week 4
(Department of Spatial Sciences)
2022-10-23
A brief introduction to Species Distribution Modelling (SDM)
Tool that aims to predict where species could potentially be located from a limited set of observations.
It can also used to estimate a species’ niche from its distribution.
It’s a huge field, primarily used in ecology and conservation.
It’s also a recent field, that quickly changes and advances, and it’s full of problems and challenges.
ODMAP protocol (Zurell et al. 2020)
Introduction to species distribution modelling (SDM) in R by Damaris Zurell
Species distribution modelling practicals (Macroecology and global change course) by Damaris Zurell
ENM2020: A Free Online Course and Set of Resources on Modeling Species’ Niches and Distributions led by Town Peterson (Youtube playlist, Schedule and PDFs, Publication)
Best Practices in Spacies Distribution Modeling: A workshop in R by Adam Smith
Species distribution modeling in R by Robert Hijmans and Jane Elith
A very brief introduction to species distribution models in R by Jeff Oliver
SDM course by Bob Muscarella (very useful for finding other resources!)
Different SDM algorithms need different types of occurrence data. These are the three main approaches:
Ensemble methods:
If we perform many models, which one do we choose?
We could choose the one that’s best suited to our data, or that performs the best.
However, a more popular approach is performing ensemble models.
In this approach, predictions from multiple models are combined or averaged to produce a single model.
The most frequently used package is biomod.
Not all species’ observations are reliable, we always need to check their quality before using them.
We will use the rnaturalearth and ggplot2 packages to view the data and CoordinateCleaner package and the provided GBIF data.
Let’s inspect the downloaded occurrences, does it make sense?
All records seem to be located in the Iberian Peninsula 👍
We will use the function clean_coordinates() from CoordinateCleaner to test for duplicates (duplicates), outliers (outliers), ensure that the coordinates match the country (centroids), and whether the coordinates are near research institutions (institutions). This process takes time, so we already provide the output.
Let’s check the cleaned data set. First we load the cleaned data set.
Then we restrict the map to Western Europe and plot the cleaned occurrences on top.
Plot of the cleaned presences
Download the climatic data from WorldClim using the package geodata. We download the data for our area of study (Spain and Portugal). The meaning of each bioclimatic variable code can be found here.
Different SDM algorithms need different types of occurrence data (presence only and presence-absence). Getting absence data is rare, however, we can artificially generate background or pseudoabsence points. For that, we will use the package dismo.
First, we prepare the occurrence data.
Then, prepare the input maps. In this case we are only going to consider the Iberian Peninsula. The rnaturalearth at scale = 110 does not include the Canary or Azores Islands (which we are not interested in). Therefore, we crop the WorldClim bioclim data to this area.
# Getting the map of Spain and Portugal
iberia_map <- rnaturalearth::ne_countries(
scale = 110,
type = "countries",
country = c("spain", "portugal"),
returnclass = "sf"
)
## Getting the extent of our area of study
geographic_extent <- st_bbox(iberia_map)
# Crop bioclim data to desired extent
bioclim_data <- crop(x = bioclim_data, y = geographic_extent)dismo requires a raster as an input to extract pseudoabsences. We will rasterize the Iberia map using the bioclim data as a base.
We randomly generate pseudo-absences with the function randomPoints. This function avoids generating pseudo-absences where there are already presences.
Plotting the presences and pseudo-absences
Adding pseudo-absences to our presence dataframe
# Prepare the presences data to contain a column indicating 1 for presence
lynx_gbif_obs <- data.frame(lynx_gbif_obs, PA=1)
# Adding that same column to the pseudoabsence data
lynx_pseudoabs <- data.frame(random_points, PA = 0)
names(lynx_pseudoabs) <- c('decimalLongitude','decimalLatitude', 'PA')
# Binding those data frames
lynx_gbif_obs <- rbind(lynx_gbif_obs, lynx_pseudoabs)Here we extract the climatic data for each of our observations using the function extract() from the package terra.
Explore the climate data for our presence and pseudoabsence data set.
decimalLatitude decimalLongitude PA ID
Min. :36.25 Min. :-9.421 Min. :0.0000 Min. : 1.0
1st Qu.:37.61 1st Qu.:-8.616 1st Qu.:0.0000 1st Qu.: 427.8
Median :38.07 Median :-6.935 Median :1.0000 Median : 854.5
Mean :38.69 Mean :-6.212 Mean :0.7073 Mean : 854.5
3rd Qu.:39.47 3rd Qu.:-4.220 3rd Qu.:1.0000 3rd Qu.:1281.2
Max. :43.64 Max. : 2.788 Max. :1.0000 Max. :1708.0
wc2.1_30s_bio_1 wc2.1_30s_bio_2 wc2.1_30s_bio_3 wc2.1_30s_bio_4
Min. : 3.683 Min. : 5.292 Min. :31.86 Min. :276.4
1st Qu.:14.365 1st Qu.: 9.306 1st Qu.:39.35 1st Qu.:410.8
Median :15.854 Median :10.542 Median :41.74 Median :555.4
Mean :15.068 Mean :10.684 Mean :41.66 Mean :538.0
3rd Qu.:16.258 3rd Qu.:12.108 3rd Qu.:44.48 3rd Qu.:650.0
Max. :18.396 Max. :14.783 Max. :47.77 Max. :769.1
NA's :36 NA's :36 NA's :36 NA's :36
wc2.1_30s_bio_5 wc2.1_30s_bio_6 wc2.1_30s_bio_7 wc2.1_30s_bio_8
Min. :20.20 Min. :-7.900 Min. :13.3 Min. :-1.767
1st Qu.:27.20 1st Qu.: 1.500 1st Qu.:21.4 1st Qu.: 9.400
Median :29.90 Median : 4.700 Median :26.5 Median :11.408
Mean :29.92 Mean : 4.018 Mean :25.9 Mean :10.745
3rd Qu.:32.80 3rd Qu.: 6.700 3rd Qu.:30.4 3rd Qu.:12.417
Max. :36.80 Max. : 9.500 Max. :36.4 Max. :18.033
NA's :36 NA's :36 NA's :36 NA's :36
wc2.1_30s_bio_9 wc2.1_30s_bio_10 wc2.1_30s_bio_11 wc2.1_30s_bio_12
Min. : 1.333 Min. :12.13 Min. :-3.217 Min. : 213.0
1st Qu.:20.367 1st Qu.:20.72 1st Qu.: 6.912 1st Qu.: 494.8
Median :21.833 Median :22.13 Median : 9.542 Median : 548.0
Mean :21.393 Mean :22.05 Mean : 8.936 Mean : 587.3
3rd Qu.:23.587 3rd Qu.:23.77 3rd Qu.:11.217 3rd Qu.: 599.0
Max. :26.433 Max. :26.43 Max. :13.350 Max. :1786.0
NA's :36 NA's :36 NA's :36 NA's :36
wc2.1_30s_bio_13 wc2.1_30s_bio_14 wc2.1_30s_bio_15 wc2.1_30s_bio_16
Min. : 30.00 Min. : 0.000 Min. :16.32 Min. : 84.0
1st Qu.: 70.75 1st Qu.: 2.000 1st Qu.:47.58 1st Qu.:188.0
Median : 92.00 Median : 4.000 Median :59.65 Median :250.0
Mean : 90.59 Mean : 7.928 Mean :56.25 Mean :246.1
3rd Qu.:102.00 3rd Qu.: 8.000 3rd Qu.:67.72 3rd Qu.:273.0
Max. :266.00 Max. :74.000 Max. :77.06 Max. :759.0
NA's :36 NA's :36 NA's :36 NA's :36
wc2.1_30s_bio_17 wc2.1_30s_bio_18 wc2.1_30s_bio_19 cell
Min. : 13 Min. : 17.00 Min. : 65.0 Min. : 19831
1st Qu.: 21 1st Qu.: 26.00 1st Qu.:171.0 1st Qu.: 774255
Median : 29 Median : 32.00 Median :229.0 Median :1027238
Mean : 41 Mean : 46.63 Mean :226.5 Mean : 915789
3rd Qu.: 41 3rd Qu.: 47.00 3rd Qu.:255.0 3rd Qu.:1111500
Max. :259 Max. :260.00 Max. :759.0 Max. :1356132
NA's :36 NA's :36 NA's :36
We perform a process called spatial thinning to keep only one presence or pseudoabsence point per raster climate cell. We will be using the spThin package.
pacman::p_load(spThin)
# spThin requires longitude and latitude coordinates, that we already have.
# It also requires a column with the species name
lynx_obs_clim$species <- 'Lynx pardinus'
# Remove adjacent cells of presence/background data:
xy <- thin(lynx_obs_clim, lat.col='decimalLatitude',long.col='decimalLongitude',
spec.col='species', thin.par=30, reps=1, write.files=F,
locs.thinned.list.return=T)We select the maximum number of records after thinning (stored in the first object of the list of xy).
We thin our data set, extracting the cell numbers for the thinned coordinates and using them to subset our data frame of presence and pseudoabsences.
Examine spatially the thinned dataset (presences in blue and pseudo absences in red)
Before running our models, we should test for correlation in our variables and select the ones that are not correlated and that are more relevant to our taxa. We will perform the Spearman correlation test, as we do not know if the data values are normally distributed.
# Running the correlation analysis for the climatic variables
cor_matrix <- cor(lynx_thinned[,-c(1:4, 24,25)], method='spearman')
# Set a correlation threshold (adjust as needed)
cor_threshold <- 0.7
# Perform hierarchical clustering to examine the hierarchical correlation between variables
dendrogram <- hclust(as.dist(1 - cor_matrix))We plot the dendrogram
Which variables should we select from each correlated cluster? Knowledge on the taxa is essential
In this case, we will keep it simple and select only Annual Mean Temperature (bio 1) and Annual Precipitation (bio 12).
# Filtering the relevant columns of the data
lynx_data <- lynx_thinned %>%
dplyr::select(species, PA, decimalLongitude, decimalLatitude, any_of(my_preds))
# Select randomly training data into 70% and 30%
set.seed(1235)
train_i <- sample(seq_len(nrow(lynx_data)), size=round(0.7*nrow(lynx_data)))
# Subset the training and testing data
lynx_train <- lynx_data[train_i,]
lynx_test <- lynx_data[-train_i,]We have the data ready to go! 🎉
Call:
glm(formula = PA ~ wc2.1_30s_bio_1 + wc2.1_30s_bio_12, family = "binomial",
data = lynx_train)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.4631 -0.9091 -0.4536 1.0240 2.5428
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -6.4418710 1.6101371 -4.001 6.31e-05 ***
wc2.1_30s_bio_1 0.4409649 0.0970562 4.543 5.54e-06 ***
wc2.1_30s_bio_12 -0.0013064 0.0008209 -1.592 0.111
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 236.68 on 193 degrees of freedom
Residual deviance: 197.06 on 191 degrees of freedom
AIC: 203.06
Number of Fisher Scoring iterations: 5
We can use the output of our model to predict the habitat suitability across the entire area of study. To achieve this, we use the function predict, passing the data of the climatic variables considered and our model. The argument type response will return the predicted probabilities from our model.
This map shows the probability of occurrence of our species across the study area. This value ranges from 0 (unsuitable) to 1 (suitable).
Model evaluation and habitat prediction have been calculated using Jeff Oliver’s tutorial.
To obtain a binary map (suitable - unsuitable), we evaluate that model using the observations and pseudo-absences reserved for testing, using the function pa_evaluate() from the predicts package.
There are multiple metrics to evaluate SDMs. The AUC is a commonly used metric, although it has its limitations. It indicates how good is the model at distinguishing classes (in our case, presence from absence) and ranges from -1 to 1.
An AUC of 1 indicates that our model is able to perfectly discriminate between areas where the species is present from where it is absent. An AUC of -1 is the complete opposite, and an AUC of 0.5 means that our model is no better than random.
np na prevalence auc cor pcor ODP
1 21 62 0.253012 0.8049155 0.447308 2.239462e-05 0.746988
What is the AUC of our model? What does this mean?
To do this, we need to identify the suitability threshold using the thresholds element from the function pa_evaluate() We chose max_spec_sense, which sets “the threshold at which the sum of the sensitivity (true positive rate) and specificity (true negative rate) is highest.”
This map is a categorical classification of whether the cell is suitable or not for the Iberian lynx using our selected threshold.